We create two world clouds about both pleased and displeased feedbacks given by customers for watches Casio AMW320R-1EV bought at www.amazon.com. The cloud with pleased feedbacks is shown as follow. Evidently, “watch” is the most popular word in the pleased feedbacks. The second is “one” and the third is “the”. Additionally, people will write some other words in their feedbacks as “price”, “time”, “battery”, “band”, “display”, “great”, “years”, etc.
It might represent that customers concern the most about price, time, battery, etc, and have good feedbacks about those aspects.
#1.1.1
data<-read.table("Five.txt",header=F, sep='\n')
data$doc_id=1:nrow(data)
colnames(data)[1]<-"text"
mycorpus <- Corpus(DataframeSource(data))
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1:2)]
wordcloud(d$word,d$freq, scale=c(8,.7),min.freq=3,max.words=60, random.order=F, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
title(main = "Word Cloud for Five.txt", font.main=1.5)
The next word cloud is created from the displeased feedbacks. The word “watch” is still the most popular. Therefore, “the”, “time” and “casio” are also very popular. Some other words are common, such as “battery”, “replace”, “back”, “amazon”, “just”, etc.
We see the “time”, “price” and “battery” here as well. So it contributes the confirmation of our assumption. Nevertheless, there might be negative or mixed feedbacks for such aspects, since we can find such words as “stopped”, “replacement”, “just”, “stop”, etc.
#1.1.2
data<-read.table("OneTwo.txt",header=F, sep='\n')
data$doc_id=1:nrow(data)
colnames(data)[1]<-"text"
mycorpus <- Corpus(DataframeSource(data))
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1:2)]
wordcloud(d$word,d$freq, scale=c(8,.7),min.freq=3,max.words=60, random.order=F, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
title(main = "Word Cloud for OneTwo.txt", font.main=1.5)
After that, We create phrase net with connector words “am, is, are, was, were” and 60 common words for both pleased and displeased feedbacks. Notice that “watch” and “I” are the most frequent words in both nets. The word “I” and “watch” connect a lot of positive adjectives in pleased feedbacks such as “awesome”, “unbeatable”, “durable”, “happy”, etc.
Phrase net
However, according to the negative-feedback phrase net, “I” and “watch” connect some negative adjectives such as “disappointed”, “sad”, “detective”, etc. In addition, we find “alarm” is the most popular word connected with “detective” besides “watch”.
Phrase net
Afterwards, we create different world trees for all the feedbacks. The first tree is supported by pleased feedbacks. It is difficult to find some useful information from such word tree.
Word tree
Then we create new word trees based on several key words such as “price”, “band”, “battery” from word cloud and phrase net. It is obvious that “great price” is the most frequent property that people mention. Many customers satisfy this good-looking watch with a cheap price compared with other watches.
Word tree
Remarkably, some satisfied customers also mention that “battery” is not good.
Word tree
On the other hand, we create several world trees for displeased feedbacks. Although the sample size of negative feedbacks is not very large, it is evident that there are some possible issues of Casio AMW320R-1EV. The first tree is the original tree (key word “seems”), many customers mention that the analog does not work sometimes.
Word tree
The following is the tree searching the sentences including “alarm” as the phrase net indicates. Several customers think that the alarm is unusable and defective. Additionally, these customers may also dislike the chronometer the watch has.
Word tree
When we choose “display” as key word from phrase net, besides chronometer display, the digit display is also be problematic for some customers.
However, “great price” is frequently mentioned in the unpleased feedbacks as well.
The satisfied customers consider that the watch doesn’t stop, it is best looking, robust, awesome watch for money, comfortable band, casual and sport, analog watch and digital, etc. Most of customers are happy that the watch works with greater accuracy and water resistance. Thus, Casio AMW320R-1EV gives these customers a good impression.
The unsatisfied customers are angry with the poor luminosity (display) of Casio AMW320R-1EV. Such other problems as the cheap rubber band, striking analog, stopped work, getting stuck in alarm mode cause the customers not to buy that watch again.
Both satisfied and unsatisfied customers mention “great price” and “battery problem” for Casio AMW320R-1EV.
To be more detailed, although most people think Casio AMW320R-1EV is a watch with great price and desirable to buy, striking analog and uncomfortable designs in digit & chronometer display and alarm are problems based on these graphs. It is preferable for the designers and department of after-sales service in CASIO pay attention to the repair and replacement of such products.
#crete shared data
olive_shared <- SharedData$new(olive)
#2.1
scatter_olive <- plot_ly(olive_shared, type = "scatter", x = ~eicosenoic, y = ~linoleic)
scatter_olive
There is a group of oils with low eicosenoic acid, whose values are 1, 2 and 3.
#2.2
bar_olive <-plot_ly(olive_shared, x=~Region)%>%add_histogram()%>%layout(barmode="overlay")
bscols(widths=c(2, NA),filter_slider("R", "Stearic", olive_shared, ~stearic),
subplot(scatter_olive,bar_olive)%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%hide_legend())
Plot 2.2.2
Base on the persistent brushing (See picture 2.2.1), We find that the group with low values for eicosenoic belongs to region 2 (South) and 3 (Sardinia island). On the contrary, the other group with higher values of eicosenoic belongs to region 1 (North). Additionally, the linoleic values of oils from region 2 are higher than that from region 3.
When using the slider (picture 2.2.2), we recognize that the lower value of the eicosenoic might lead to the higher value of stearic for olive oils and vice versus. To make sure about this conclusion, we draw a scatter plot as follow:
ggplot(olive, aes(y = linoleic, x = stearic))+
geom_point()+
geom_smooth(method = "loess")
The plot concentrates our conclusion. Regardless of some outliers in the bottom left corner, it shows a negative relationship between eicosenoic and stearic values. Another conclusion is the value of stearic is not relevant to the region. In this task, we use selection (persistent brushing), connection (bar-scatter plots), and filtering (slider selection) operators.
#2.3
scatter_olive2 <- plot_ly(olive_shared, type = "scatter", x = ~arachidic, y = ~linolenic)%>%
add_markers(color = I("lightblue"))
subplot(scatter_olive,scatter_olive2)%>%
highlight(on="plotly_select", dynamic=T, persistent=T, opacityDim = I(1))%>%hide_legend()
Plot 2.3.1
In this task, we create a link plot between the scatter plot of eicosenoic~linoleic (plot1) and arachidic~linolenic (plot2). Using brushing to plot2 (picture 2.3.1), we see that outliers in plot2 almost locate in outliers (low value of elicosenoic) in plot1. Plot2 outliers are grouped in plot1 and most of them are from region 3.
#2.4
p<-ggparcoord(olive, columns = c(4:11))
d<-plotly_data(ggplotly(p))%>%group_by(.ID)
d1<-SharedData$new(d, ~.ID, group="olive")
p1<-plot_ly(d1, x=~variable, y=~value)%>%
add_lines(line=list(width=0.3))%>%
add_markers(marker=list(size=0.3),
text=~.ID, hoverinfo="text")
p2<-plot_ly(d1, x=~factor(Region) )%>%add_histogram()%>%layout(barmode="overlay")
ButtonsX=list()
for (i in 2:11){
ButtonsX[[i-1]]= list(method = "restyle",
args = list( "x", list(olive[[i]])),
label = colnames(olive)[i])
}
ButtonsY=list()
for (i in 2:11){
ButtonsY[[i-1]]= list(method = "restyle",
args = list( "y", list(olive[[i]])),
label = colnames(olive)[i])
}
ButtonsZ=list()
for (i in 2:11){
ButtonsZ[[i-1]]= list(method = "restyle",
args = list( "z", list(olive[[i]])),
label = colnames(olive)[i])
}
olive2=olive[, 2:11]
olive2$.ID=1:nrow(olive)
d2<-SharedData$new(olive2, ~.ID, group="olive")
p3 <- plot_ly(d2, x = ~eicosenoic, y = ~linoleic, z= ~oleic, alpha = 0.8) %>%
add_markers() %>%
layout(scene = list(
xaxis=list(title=""),
yaxis=list(title=""),
zaxis=list(title="")
),
updatemenus = list(
list(y=0.9, buttons = ButtonsX),
list(y=0.7, buttons = ButtonsY),
list(y=0.5, buttons = ButtonsZ)
)
)
#show
ps<-htmltools::tagList(p1%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend(),
p2%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend(),
p3%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend()
)
htmltools::browsable(ps)
Plot 2.4.1
Plot 2.4.2
Plot 2.4.3
We use brushing for separating 3 region, see picture 2.4.1, 2.4.2 and 2.4.3.
All the oils from region 3 have low values of eicosenoic and linoleic, as well as the high values of oleic. The acids’ values of oils from region 1 are opposite to those from region 3. In addition, the oils from region 2 have similar values for eicosenoic with oils from region 1, but have lower value for linoeic and higher for oleic.
To sum up, it is true that the parallel coordinate plot demonstrates the distinguishment of oils from different regions. By choosing these elements (picture 2.4.3), we can see three different clusters quite clearly. Eicosenoic, linoleic and oleic are three most influential variables to distinguish those regions.
The Interaction operators and operands which we used in step 2.4 contains Navigation (3D scatter), Selection (persistent brushing in regions), Connecting (parallel-scatter-bar), Reconfiguring (3D scatter). Since we have many features in our data, it is possible to use abstraction and filtering to help us compare different features.
The strategy to distinguish which region an olive oil come from, we can measure the eicosenoic value firstly to check whether it is from region 1 (North) or not. Then we can measure the oleic value to check whether it is from region 2 (South). If the sample does have a high value of oleic and a low value of eicosenoic, we can conclude that such oil is from region 3 (Sardinia island).
Please put the code button to see the code of this report.
library(tm)
library(wordcloud)
library(RColorBrewer)
library(plotly)
library(crosstalk)
library(tidyr)
library(GGally)
set.seed(15)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, include=TRUE)
#1.1.1
data<-read.table("Five.txt",header=F, sep='\n')
data$doc_id=1:nrow(data)
colnames(data)[1]<-"text"
mycorpus <- Corpus(DataframeSource(data))
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1:2)]
wordcloud(d$word,d$freq, scale=c(8,.7),min.freq=3,max.words=60, random.order=F, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
title(main = "Word Cloud for Five.txt", font.main=1.5)
#1.1.2
data<-read.table("OneTwo.txt",header=F, sep='\n')
data$doc_id=1:nrow(data)
colnames(data)[1]<-"text"
mycorpus <- Corpus(DataframeSource(data))
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1:2)]
wordcloud(d$word,d$freq, scale=c(8,.7),min.freq=3,max.words=60, random.order=F, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
title(main = "Word Cloud for OneTwo.txt", font.main=1.5)
input_path <- "olive.csv"
olive <- read.csv(file = input_path)
olive$Region <- as.factor(olive$Region)
#crete shared data
olive_shared <- SharedData$new(olive)
#2.1
scatter_olive <- plot_ly(olive_shared, type = "scatter", x = ~eicosenoic, y = ~linoleic)
scatter_olive
#2.2
bar_olive <-plot_ly(olive_shared, x=~Region)%>%add_histogram()%>%layout(barmode="overlay")
bscols(widths=c(2, NA),filter_slider("R", "Stearic", olive_shared, ~stearic),
subplot(scatter_olive,bar_olive)%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%hide_legend())
ggplot(olive, aes(y = linoleic, x = stearic))+
geom_point()+
geom_smooth(method = "loess")
#2.3
scatter_olive2 <- plot_ly(olive_shared, type = "scatter", x = ~arachidic, y = ~linolenic)%>%
add_markers(color = I("lightblue"))
subplot(scatter_olive,scatter_olive2)%>%
highlight(on="plotly_select", dynamic=T, persistent=T, opacityDim = I(1))%>%hide_legend()
#2.4
p<-ggparcoord(olive, columns = c(4:11))
d<-plotly_data(ggplotly(p))%>%group_by(.ID)
d1<-SharedData$new(d, ~.ID, group="olive")
p1<-plot_ly(d1, x=~variable, y=~value)%>%
add_lines(line=list(width=0.3))%>%
add_markers(marker=list(size=0.3),
text=~.ID, hoverinfo="text")
p2<-plot_ly(d1, x=~factor(Region) )%>%add_histogram()%>%layout(barmode="overlay")
ButtonsX=list()
for (i in 2:11){
ButtonsX[[i-1]]= list(method = "restyle",
args = list( "x", list(olive[[i]])),
label = colnames(olive)[i])
}
ButtonsY=list()
for (i in 2:11){
ButtonsY[[i-1]]= list(method = "restyle",
args = list( "y", list(olive[[i]])),
label = colnames(olive)[i])
}
ButtonsZ=list()
for (i in 2:11){
ButtonsZ[[i-1]]= list(method = "restyle",
args = list( "z", list(olive[[i]])),
label = colnames(olive)[i])
}
olive2=olive[, 2:11]
olive2$.ID=1:nrow(olive)
d2<-SharedData$new(olive2, ~.ID, group="olive")
p3 <- plot_ly(d2, x = ~eicosenoic, y = ~linoleic, z= ~oleic, alpha = 0.8) %>%
add_markers() %>%
layout(scene = list(
xaxis=list(title=""),
yaxis=list(title=""),
zaxis=list(title="")
),
updatemenus = list(
list(y=0.9, buttons = ButtonsX),
list(y=0.7, buttons = ButtonsY),
list(y=0.5, buttons = ButtonsZ)
)
)
#show
ps<-htmltools::tagList(p1%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend(),
p2%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend(),
p3%>%
highlight(on="plotly_select", dynamic=T, persistent = T, opacityDim = I(1))%>%
hide_legend()
)
htmltools::browsable(ps)